{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering and Annotation Tutorial\n",
    "\n",
    "### Filter\n",
    "\n",
    "You can filter the rows of a table with [Table.filter](https://hail.is/docs/0.2/hail.Table.html#hail.Table.filter).   This returns a table of those rows for which the expression evaluates to `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import hail as hl\n",
    "\n",
    "hl.utils.get_movie_lens('data/')\n",
    "users = hl.read_table('data/users.ht')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.filter(users.occupation == 'programmer').count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also express this query in multiple ways using [aggregations](https://hail.is/docs/0.2/tutorials/04-aggregation.html): "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.filter(users.occupation == 'programmer', hl.agg.count()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.counter(users.occupation == 'programmer'))[True]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Annotate\n",
    "\n",
    "You can add new fields to a table with [annotate](https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate).  As an example, let's create a new column called `cleaned_occupation` that replaces missing entries in the `occupation` field labeled as 'other' with 'none.'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "missing_occupations = hl.set(['other', 'none'])\n",
    "\n",
    "t = users.annotate(\n",
    "    cleaned_occupation=hl.if_else(missing_occupations.contains(users.occupation), hl.missing('str'), users.occupation)\n",
    ")\n",
    "t.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare this to what we had before:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: `annotate` is functional: it doesn't mutate `users`, but returns a new table.  This is also true of `filter`.  In fact, all operations in Hail are functional."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select and Transmute\n",
    "There are two other annotate methods: [select](https://hail.is/docs/0.2/hail.Table.html#hail.Table.select) and [transmute](https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute).  `select` allows you to create new tables from old ones by selecting existing fields, or creating new ones."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's extract the `sex` and `occupation` fields:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.select(users.sex, users.occupation).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also create a new field that stores the age relative to the average. Note that new fields *must* be assigned a name (in this case `mean_shifted_age`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean_age = round(users.aggregate(hl.agg.stats(users.age)).mean)\n",
    "users.select(users.sex, users.occupation, mean_shifted_age=users.age - mean_age).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`transmute` replaces any fields mentioned on the right-hand side with the new fields, but leaves unmentioned fields unchanged.  `transmute` is useful for transforming data into a new form. Compare the following two snippts of code. The second is identical to the first with `transmute` replacing `select.`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "missing_occupations = hl.set(['other', 'none'])\n",
    "\n",
    "t = users.select(\n",
    "    cleaned_occupation=hl.if_else(missing_occupations.contains(users.occupation), hl.missing('str'), users.occupation)\n",
    ")\n",
    "t.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "missing_occupations = hl.set(['other', 'none'])\n",
    "\n",
    "t = users.transmute(\n",
    "    cleaned_occupation=hl.if_else(missing_occupations.contains(users.occupation), hl.missing('str'), users.occupation)\n",
    ")\n",
    "t.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Global Fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, you can add global fields with [annotate_globals](https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate_globals).  Globals are useful for storing metadata about a dataset or storing small data structures like sets and maps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t = users.annotate_globals(cohort=5, cloudable=hl.set(['sample1', 'sample10', 'sample15']))\n",
    "t.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t.cloudable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hl.eval(t.cloudable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "\n",
    "- [Z-score normalize](https://en.wikipedia.org/wiki/Standard_score) the age field of `users`.\n",
    "- Convert `zip` to an integer.  Hint: Not all zipcodes are US zipcodes!  Use [hl.int32](https://hail.is/docs/0.2/functions/constructors.html#hail.expr.functions.int32) to convert a string to an integer.  Use [StringExpression.matches](https://hail.is/docs/0.2/expressions.html#hail.expr.expressions.StringExpression.matches) to see if a string matches a regular expression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
